Experiments in Data Compression II (or "Why couldn't I have thought of this sooner?") by John Kormylo E.L.F. Software State of the Art In the following table are shown the results for compressing three different files using three different methods. 'ZIP' refers to STZIP 2.5 (deflating). 'OLD' refers to the improved LZA method documented in the previous report using a single 16K search tree. 'BEST' refers to the best results obtained previously, using three search trees and testing all possible ways to compress the next 65 characters. | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ ZIP | 25207 | 32444 | 496880 | OLD | 24942 | 34003 | 670774 | BEST | 22196 | 32035 | 455017 | -------+--------------+-------------+--------------+ Redundancy Weighting The main inefficiency of the old LZA method is that while there can be many matches of a given length, only one can be used at a time. No matter how heavily the entry used is weighted, there is still lost information in the other possible matches. Using a hash table instead of a search tree could solve this problem, but hash tables are inefficient for storage (a 16K binary search tree with a maximum match length of 65 characters would require up to 1,054,960 entries in a hash table). Also, once something is put into a hash table it can never be taken out again (at least, using the implementation documented in "The Data Compression Book" by Mark Nelson). However, since the binary search tree has already alphabetized the "dictionary" of possible matches, it is relatively easy to use an index corresponding to its position in alphabetical order. This can be done easily by including in the node structure a count of how many nodes are in this branch (actually, the self-balancing functions already require this). More important, it is also easy to locate the largest and smallest indexes corresponding to a match of a given length (and certainly faster than searching for the most heavily weighted match). The 'weight' for this 'match' is therefore the number of matches for this length in the search tree (as it turns out, the arithmetic compression algorithm normally implements weights as a range of index values). The length (and match indicator) codes are implemented using character codes 256 through 320, as before. When expanding the data, one will get some value in the range of possible matches. One would then search through the binary tree to locate the specific dictionary entry. Given the text pointer from this node, one can then go through the search tree again and locate the largest and smallest matches for this length. Obviously, this approach is more efficient than any possible weighting scheme involving single dictionary entries. Also, this approach is relatively insensitive to dictionary size, since the number of matches should increase proportionately as the number of entries increase (assuming a uniformly random distribution of matches). Other advantages are that the node structure does not need to store the weight, last match length or sort index, and that one need not sort the indexes into weight order. The main disadvantage is that the expansion algorithm must also sort the entries into a binary search tree. Not only does this make the expansion slower, it requires that the compression algorithm not use information not yet coded when constructing the search tree. Normally, the compression algorithm uses all 65 (or whatever the maximum match length is) future characters to place a text pointer into the tree. Needless to say, the expansion algorithm does not have access to these future characters. Another disadvantage is that one cannot use this approach to match future characters. When the old LZA method ran into a sequence of repeated characters (not yet in the dictionary), it would code the first normally and code the rest by matching the text pointer just entered into the tree. The expansion algorithm would recursively generate the entire sequence: for(i=0; i 536383 | 128 | 38167 | 41574 | 538023 | 256 | 34187 | 39331 | 548685 | 512 | 31076 | 37480 | 564429 | 1024 | 27844 | 36567 | 581900 | 2048 | 25690 | 35234 | 601221 | 4096 | 24088 | 34728 | 624977 | 8192 | 23235 | > 34571 | 644742 | 16384 | > 23043 | 34693 | 660107 | FIFO Buffer An alternative solution is to not add match locations to the search tree until all 65 characters have been coded. This can be implemented using a 64 entry FIFO (First In/ First Out) buffer to hold these new locations. In fact, one can use this buffer as an alternative search tree, similar to the multiple search tree methods explored in the previous report. This should also improve performance on files with locally clustered matches. The following table shows the results for this approach using a 64 entry tree (unweighted) and a 16K entry tree (weighted by the number of matches). Again, using the longest match is labeled as 'LONG', using the fewest bits/character is labeled as 'AVG', and using the (nearly) optimal selections as 'BEST'. | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ LONG | 24618 | 33418 | 597383 | AVG | 23991 | 33269 | 555619 | BEST | 21893 | 32147 | 496517 | -------+--------------+-------------+--------------+ Both the LONG and AVG algorithms out-performed the OLD algorithm on all three files. However, the new BEST algorithm performed better on only one of the files. It should be noted that entries in the FIFO buffer can be flushed to the main dictionary as soon as all 65 bytes are transmitted, effectively reducing the size of the FIFO buffer. However, as shown in the next table, this doesn't buy you much. In fact, it is worse on two of the three files: | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ LONG | 24630 | 33544 | 599469 | AVG | 24082 | 33466 | 540713 | BEST | 22034 | 32345 | 473610 | -------+--------------+-------------+--------------+ On the other hand, increasing the FIFO buffer to 128 entries doesn't help either. FIFO Weighting The reason the FIFO buffer approach works better than backwards sorting is that the statistical assumption of a uniform random distribution of matches is not valid for some files. In this case the probabilty of finding a match in the 64 entry FIFO buffer is much greater than the number of equilvalent matches in the main dictionary divided by the total number of entries. That being the case, it would also seem reasonable that the probability of finding a match in the most recent half of the FIFO buffer is greater than in the older half, etc. One could assign a weight associated with each entry in the FIFO buffer based on how long it has been there, and use those weights when coding the index. These weights would, of course, consist of (scaled) counts of how many times this 'lag' has been used (initially set to 1). Since there are only 64 entries, this should not impose much additional computation cost. The next table shows the results for using these weights for the FIFO buffer: | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ LONG | 24620 | 33375 | 583218 | AVG | 23991 | 33207 | 530798 | BEST | 22022 | 32088 | 461457 | -------+--------------+-------------+--------------+ As expected, these results were not quite as good for DEMONSTR.DOC, but better for the other two files. A list of the weights generated using the 'AVG' approach is shown in the Appendix. (It should be pointed out that this is not strictly speaking a 'lag' in the time series sense, since each increment can mean anywhere from 1 to 65 bytes.) Break Even Another advantage of redundancy weighting is that it does a much better job for smaller matches. (One could theoretically use it for single characters, except that you would still need some way to get the dictionary started.) However, early testing indicated that the best results were obtained by never using matches of length 2 and always using matches of length 3 (as opposed to testing for break even as was done using the 'OLD' LZA algorithm.) The next table show the results of using length 2 matches with FIFO weighting: | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ LONG | 24860 | 34719 | 585185 | AVG | 24054 | 34394 | 520875 | BEST | 22029 | 31997 | 453143 | -------+--------------+-------------+--------------+ Comparing this to the previous table, we see that including length 2 matches only works well with the 'BEST' algorithm (which implicitly contains a break even test). More importantly, this allows the new 'BEST' to beat the old 'BEST' (in the first table) on all three files. Appendix - Normalized Weights (count / total * 1000) lag | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ 1 | 3 | 21 | 12 | 2 | 5 | 63 | 251 | 3 | 14 | 23 | 109 | 4 | 8 | 66 | 88 | 5 | 18 | 33 | 59 | 6 | 24 | 37 | 48 | 7 | 13 | 24 | 35 | 8 | 19 | 42 | 31 | 9 | 21 | 28 | 26 | 10 | 8 | 35 | 23 | 11 | 21 | 22 | 19 | 12 | 20 | 26 | 16 | 13 | 33 | 20 | 15 | 14 | 24 | 24 | 13 | 15 | 21 | 18 | 13 | 16 | 19 | 17 | 12 | 17 | 26 | 18 | 10 | 18 | 25 | 16 | 10 | 19 | 9 | 18 | 10 | 20 | 25 | 15 | 9 | 21 | 15 | 18 | 8 | 22 | 17 | 17 | 8 | 23 | 14 | 18 | 7 | 24 | 28 | 15 | 7 | 25 | 22 | 13 | 7 | 26 | 17 | 13 | 6 | 27 | 23 | 20 | 6 | 28 | 16 | 15 | 6 | 29 | 17 | 14 | 6 | 30 | 17 | 9 | 5 | 31 | 18 | 14 | 6 | 32 | 14 | 11 | 5 | 33 | 34 | 11 | 5 | 34 | 12 | 8 | 5 | 35 | 26 | 12 | 5 | 36 | 13 | 10 | 4 | 37 | 17 | 9 | 5 | 38 | 17 | 13 | 5 | 39 | 20 | 7 | 4 | 40 | 13 | 10 | 4 | 41 | 16 | 10 | 4 | 42 | 11 | 9 | 4 | 43 | 9 | 7 | 4 | 44 | 9 | 8 | 4 | 45 | 15 | 3 | 4 | 46 | 13 | 4 | 3 | 47 | 18 | 8 | 3 | 48 | 7 | 9 | 3 | 49 | 14 | 5 | 3 | 50 | 9 | 9 | 3 | -------+--------------+-------------+--------------+ lag | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ 51 | 15 | 9 | 3 | 52 | 14 | 5 | 3 | 53 | 20 | 8 | 3 | 54 | 9 | 6 | 3 | 55 | 8 | 6 | 3 | 56 | 8 | 9 | 3 | 57 | 13 | 5 | 3 | 58 | 10 | 11 | 3 | 59 | 4 | 13 | 3 | 60 | 15 | 9 | 2 | 61 | 12 | 4 | 2 | 62 | 7 | 9 | 3 | 63 | 8 | 8 | 3 | 64 | 9 | 7 | 2 | -------+--------------+-------------+--------------+